PM566 HW2

Author

Katherine

library(nycflights13)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(scales)

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
library(maps)

Attaching package: 'maps'

The following object is masked from 'package:purrr':

    map
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
airlines <- airlines
airports <- airports
flights  <- flights
planes   <- planes
weather  <- weather

Question 1

flights %>%
  count(dest, sort = TRUE) %>%
  top_n(10, n)
# A tibble: 10 × 2
   dest      n
   <chr> <int>
 1 ORD   17283
 2 ATL   17215
 3 LAX   16174
 4 BOS   15508
 5 MCO   14082
 6 CLT   14064
 7 SFO   13331
 8 FLL   12055
 9 MIA   11728
10 DCA    9705

From this question, we can see the top 10 destination and how many flight went to each, and the top 1 is O’Hare international Airport located in Chicago, Illinois has 17283 flights went to that airport.

Question 2

library(dplyr)

time_bucket <- function(time) {
  hour <- floor(time / 100)  # convert HHMM to hour
  case_when(
    is.na(hour) ~ NA_character_,
    hour >= 0  & hour < 6  ~ "early morning",  # 12am–6am
    hour >= 6  & hour < 12 ~ "morning",        # 6am–12pm
    hour >= 12 & hour < 18 ~ "afternoon",      # 12pm–6pm
    hour >= 18 & hour < 24 ~ "evening"         # 6pm–12am
  )
}

flights <- flights %>%
  mutate(
    dep_period = time_bucket(dep_time),
    arr_period = time_bucket(arr_time)
  )
# check results 
table(flights$dep_period)

    afternoon early morning       evening       morning 
       120761          8730         76708        122293 
table(flights$arr_period)

    afternoon early morning       evening       morning 
       108911         11778        118718         88506 
ggplot(flights, aes(x = dep_period)) +
  geom_bar(fill = "blue") +
  labs(title = "Flights by Departure Period",
       x = "Departure Period", y = "Number of Flights")

ggplot(flights, aes(x = arr_period)) +
  geom_bar(fill = "red") +
  labs(title = "Flights by Arrival Period",
       x = "Arrival Period", y = "Number of Flights")

#red eye flights departure late at night (around 9 p.m. to 1 a.m.) and an arrival early the next morning (around 5 a.m. to 7 a.m.)
# Find red-eye flights: depart afternoon/evening & arrive early morning/morning
red_eye <- flights %>%
  filter(
    dep_period %in% c("afternoon", "evening") &
    arr_period %in% c("early morning", "morning")
  )

# Calculate percentage of red-eye flights
red_eye_percent <- nrow(red_eye) / nrow(flights) * 100
red_eye_percent
[1] 3.14868

**Base on Google search, we know that red eye flights departure late at night (around 9 p.m. to 1 a.m.) and an arrival early the next morning (around 5 a.m. to 7 a.m.), but the question defines these as flights that depart in “afternoon” or “evening” and arrive in “early morning” or “morning.” Base on that the percentage calculation is 3.14868% of the flights were “red eye” flights.

Question 3

flights %>%
  filter(!is.na(tailnum)) %>%
  group_by(tailnum) %>%
  summarise(n_airlines = n_distinct(carrier),
            carriers = paste(sort(unique(carrier)), collapse = ", "),
            .groups = "drop") %>%
  filter(n_airlines > 1)
# A tibble: 17 × 3
   tailnum n_airlines carriers
   <chr>        <int> <chr>   
 1 N146PQ           2 9E, EV  
 2 N153PQ           2 9E, EV  
 3 N176PQ           2 9E, EV  
 4 N181PQ           2 9E, EV  
 5 N197PQ           2 9E, EV  
 6 N200PQ           2 9E, EV  
 7 N228PQ           2 9E, EV  
 8 N232PQ           2 9E, EV  
 9 N933AT           2 DL, FL  
10 N935AT           2 DL, FL  
11 N977AT           2 DL, FL  
12 N978AT           2 DL, FL  
13 N979AT           2 DL, FL  
14 N981AT           2 DL, FL  
15 N989AT           2 DL, FL  
16 N990AT           2 DL, FL  
17 N994AT           2 DL, FL  

There are 17 plans that flew for multiple airlines (2 mostly). They fly mostly for both 9E - Endeavor Air Inc. and EV - ExpressJet Airlines Inc. or DL - Delta Air Lines Inc. and FL - AirTran Airways Corporation.

Question 4 **The missing relationship is between weather\(origin and airport\)faa. In weather the origin represents three NYC main airports and shows which airport the weather data was recorded. In airports, the variable faa is the airport code, which means we can connect the weather data to airport data. There should be a line from weather to airport in the figure.

table(weather$origin)

 EWR  JFK  LGA 
8703 8706 8706 
table(airports$faa)

04G 06A 06C 06N 09J 0A9 0G6 0G7 0P2 0S9 0W3 10C 17G 19A 1A3 1B9 1C9 1CS 1G3 1G4 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
1H2 1OH 1RL 23M 24C 24J 25D 29D 2A0 2B2 2G2 2G9 2H0 2J9 369 36U 38W 3D2 3G3 3G4 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
3J1 3W2 40J 41N 47A 49A 49X 4A4 4A7 4A9 4B8 4G0 4G2 4G4 4I7 4U9 52A 54J 55J 55S 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
57C 5B2 60J 6A2 6J4 6K8 6S0 6S2 6Y8 70J 70N 7A4 7D9 7N7 8M8 93C 99N 9A1 9A5 9G1 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
A39 A50 AAF AAP ABE ABI ABL ABQ ABR ABY ACJ ACK ACT ACV ACY ADK ADM ADQ ADS ADW 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
AET AEX AFE AFW AGC AGN AGS AHN AIA AIK AIN AIZ AKB AKC AKI AKK AKN AKP ALB ALI 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
ALM ALO ALS ALW ALX ALZ AMA ANB ANC AND ANI ANN ANP ANQ ANV AOH AOO AOS APA APC 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
APF APG APN AQC ARA ARB ARC ART ARV ASE ASH AST ATK ATL ATT ATW ATY AUG AUK AUO 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
AUS AUW AVL AVO AVP AVW AVX AZA AZO BAB BAD BAF BBX BCE BCT BDE BDL BDR BEC BED 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
BEH BET BFD BFF BFI BFL BFM BFP BFT BGE BGM BGR BHB BHM BID BIF BIG BIL BIS BIV 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
BIX BJC BJI BKC BKD BKF BKG BKH BKL BKW BKX BLD BLF BLH BLI BLV BMC BMG BMI BMT 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
BMX BNA BOI BOS BOW BPT BQK BRD BRL BRO BRW BSF BTI BTM BTR BTT BTV BUF BUR BUU 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
BUY BVY BWD BWG BWI BXK BXS BYH BYS BYW BZN C02 C16 C47 C65 C89 C91 CAE CAK CAR 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
CBE CBM CCO CCR CDB CDC CDI CDK CDN CDR CDS CDV CDW CEC CEF CEM CEU CEW CEZ CFD 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
CGA CGC CGF CGI CGX CGZ CHA CHI CHO CHS CHU CIC CID CIK CIL CIU CKB CKD CKF CKV 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
CLC CLD CLE CLL CLM CLS CLT CLW CMH CMI CMX CNM CNW CNY COD COE COF COI CON COS 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
COT COU CPR CPS CRE CRP CRW CSG CTB CTH CTJ CTY CVG CVN CVO CVS CVX CWA CWI CWT 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
CXF CXL CXO CXY CYF CYM CYS CYT CZF CZG CZN DAB DAL DAW DAY DBN DBQ DCA DDC DEC 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
DEN DET DFW DGL DHB DHN DHT DIK DKB DKK DKX DLF DLG DLH DLL DMA DNL DNN DNV DOV 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
DPA DQH DRG DRI DRM DRO DRT DSM DTA DTS DTW DUC DUG DUJ DUT DVL DVT DWA DWH DWS 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
DXR DYS E25 E51 E55 E63 E91 EAA EAR EAT EAU ECA ECG ECP EDF EDW EEK EEN EET EFD 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
EGA EGE EGT EGV EGX EHM EIL EKI EKN EKO EKY ELD ELI ELM ELP ELV ELY EMK EMP ENA 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
END ENV ENW EOK EPM EQY ERI ERV ERY ESC ESD ESF ESN EUF EUG EVV EVW EWB EWK EWN 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
EWR EXI EYW F57 FAF FAI FAR FAT FAY FBG FBK FBR FBS FCA FCM FCS FDW FDY FFA FFC 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
FFO FFT FFZ FHU FIT FKL FLD FLG FLL FLO FLV FME FMH FMN FMY FNL FNR FNT FOD FOE 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
FOK FRD FRI FRN FRP FSD FSI FSM FST FTK FTW FTY FUL FWA FXE FYU FYV FZG FZI GAD 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
GAI GAL GAM GBN GCC GCK GCN GCW GDV GDW GED GEG GEU GFK GFL GGE GGG GGW GHG GIF 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
GJT GKN GKY GLD GLH GLS GLV GNT GNU GNV GON GPT GPZ GQQ GRB GRF GRI GRK GRM GRR 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
GSB GSO GSP GST GTB GTF GTR GTU GUC GUP GUS GVL GVQ GVT GWO GYY HBG HBR HCC HCR 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
HDH HDI HDN HDO HFD HGR HHH HHI HHR HIB HIF HII HIO HKB HKY HLG HLN HLR HMN HNH 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
HNL HNM HNS HOB HOM HON HOP HOT HOU HPB HPN HQM HQU HRL HRO HRT HSH HSL HST HSV 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
HTL HTS HUA HUF HUL HUS HUT HVN HVR HWD HWO HXD HYA HYG HYL HYS HZL IAB IAD IAG 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
IAH IAN ICT ICY IDA IDL IFP IGG IGM IGQ IJD IKK IKO IKR IKV ILG ILI ILM ILN IMM 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
IMT IND INJ INK INL INS INT INW IOW IPL IPT IRC IRK ISM ISN ISO ISP ISW ITH ITO 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
IWD IWS IYK IZG JAC JAN JAX JBR JCI JEF JES JFK JGC JHM JHW JKA JLN JMS JNU JOT 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
JRA JRB JST JVL JXN JYL JYO JZP K03 K27 K83 KAE KAL KBC KBW KCC KCL KCQ KEH KEK 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
KFP KGK KGX KKA KKB KKH KLG KLL KLN KLS KLW KMO KMY KNW KOA KOT KOY KOZ KPB KPC 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
KPN KPR KPV KPY KQA KSM KTB KTN KTS KUK KVC KVL KWK KWN KWP KWT KYK KYU KZB L06 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
L35 L52 LAA LAF LAL LAM LAN LAR LAS LAW LAX LBB LBE LBF LBL LBT LCH LCK LCQ LDJ 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
LEB LEW LEX LFI LFK LFT LGA LGB LGC LGU LHD LHM LHV LHX LIH LIT LIV LKE LKK LKP 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
LMT LNA LNK LNN LNR LNS LNY LOT LOU LOZ LPC LPR LPS LRD LRF LRO LRU LSE LSF LSV 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
LTS LUF LUK LUP LUR LVK LVM LVS LWA LWB LWC LWM LWS LWT LXY LYH LYU LZU M94 MAE 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
MAF MBL MBS MCC MCD MCE MCF MCG MCI MCK MCL MCN MCO MCW MDT MDW ME5 MEI MEM MER 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
MFD MFE MFI MFR MGC MGE MGJ MGM MGR MGW MGY MHK MHM MHR MHT MHV MIA MIB MIC MIE 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
MIV MKC MKE MKG MKK MKL MKO MLB MLC MLD MLI MLJ MLL MLS MLT MLU MLY MMH MMI MMU 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
MMV MNM MNT MNZ MOB MOD MOT MOU MPB MPI MPV MQB MQI MQT MRB MRI MRK MRN MRY MSL 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
MSN MSO MSP MSS MSY MTC MTH MTJ MTM MTN MUE MUI MUO MVL MVY MWA MWC MWH MWL MWM 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
MXF MXY MYF MYL MYR MYU MYV MZJ N53 N69 N87 NBG NBU NCN NEL NEW NFL NGF NGP NGU 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
NGZ NHK NIB NID NIP NJK NKT NKX NLC NLG NME NMM NNL NOW NPA NPZ NQA NQI NQX NSE 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
NTD NTU NUI NUL NUP NUQ NUW NXP NXX NY9 NYC NYG NZC NZJ NZY O03 O27 OAJ OAK OAR 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
OBE OBU OCA OCF OEB OFF OGG OGS OJC OKC OLF OLH OLM OLS OLT OLV OMA OME OMN ONH 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
ONP ONT OOK OPF OQN OQU ORD ORF ORH ORI ORL ORT ORV OSC OSH OSU OTH OTS OTZ OWB 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
OWD OXC OXD OXR OZA P08 P52 PAE PAH PAM PAO PAQ PBF PBG PBI PBV PBX PCW PCZ PDB 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
PDK PDT PDX PEC PEQ PFN PGA PGD PGV PHD PHF PHK PHL PHN PHO PHX PIA PIB PIE PIH 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
PIM PIP PIR PIT PIZ PKB PLN PMB PMD PML PMP PNC PNE PNM PNS POB POC POE POF PPC 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
PPV PQI PQS PRC PRL PSC PSG PSM PSP PSX PTA PTB PTH PTK PTU PUB PUC PUW PVC PVD 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
PVU PWK PWM PWT PYM PYP R49 RAC RAL RAP RBD RBK RBM RBN RBY RCA RCE RCZ RDD RDG 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
RDM RDR RDU RDV REI RFD RHI RIC RID RIF RIL RIR RIU RIV RIW RKD RKH RKP RKS RME 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
RMG RMP RMY RND RNM RNO RNT ROA ROC ROW RSH RSJ RST RSW RUT RVS RWI RWL RYY S30 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
S40 S46 SAA SAC SAD SAF SAN SAT SAV SBA SBD SBM SBN SBO SBP SBS SBY SCC SCE SCH 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
SCK SCM SDC SDF SDM SDP SDX SDY SEA SEE SEF SEM SES SFB SFF SFM SFO SFZ SGF SGH 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
SGJ SGR SGU SGY SHD SHG SHH SHR SHV SHX SIK SIT SJC SJT SKA SKF SKK SKY SLC SLE 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
SLK SLN SLQ SMD SME SMF SMK SMN SMO SMX SNA SNP SNY SOP SOW SPB SPF SPG SPI SPS 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
SPW SPZ SQL SRQ SRR SRV SSC SSI STC STE STG STJ STK STL STS SUA SUE SUN SUS SUU 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
SUX SVA SVC SVH SVN SVW SWD SWF SXP SXQ SYA SYB SYR SZL TAL TAN TBN TCC TCL TCM 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
TCS TCT TEB TEK TEX TIK TIW TIX TKA TKE TKF TKI TLA TLH TLJ TLT TMA TMB TNC TNK 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
TNT TNX TOA TOC TOG TOL TOP TPA TPL TRI TRM TSS TTD TTN TUL TUP TUS TVC TVF TVI 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
TVL TWA TWD TWF TXK TYE TYR TYS TZR U76 UCA UDD UDG UES UGN UIN UMP UNK UPP UST 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
UT3 UTM UTO UUK UUU UVA VAD VAK VAY VBG VCT VCV VDF VDZ VEE VEL VGT VIS VLD VNW 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
VNY VOK VPC VPS VRB VSF VYS W04 W13 WAA WAL WAS WBB WBQ WBU WBW WDR WFB WFK WHD 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
WHP WIH WKK WKL WLK WMO WRB WRG WRI WRL WSD WSJ WSN WST WSX WTK WTL WWD WWP WWT 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
WYS X01 X04 X07 X21 X26 X39 X49 X59 XFL XNA XZK Y51 Y72 YAK YIP YKM YKN YNG YUM 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
Z84 ZBP ZFV ZPH ZRA ZRD ZRP ZRT ZRZ ZSF ZSY ZTF ZTY ZUN ZVE ZWI ZWU ZYP 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 

Question 5

weather <- weather %>%
  mutate(id = paste(year, month, day, hour, origin, sep = "_"))
sum(duplicated(weather$id))
[1] 3
weather %>%
  count(id) %>%
  filter(n > 1)
# A tibble: 3 × 2
  id                  n
  <chr>           <int>
1 2013_11_3_1_EWR     2
2 2013_11_3_1_JFK     2
3 2013_11_3_1_LGA     2

There are three duplicate values, and they are all in year 2013, November third 1 AM in the morning from all three main airport in NYC. Daylight saving time change is the reason because clocks were turned back one hour, creating two weather records for the same local hour at each airport. As confirmed on Google, that 2013’s day light saving end clock backward is on Sunday, November 3, 2 AM.

Question 6

#merge the flight and eather data together by origin and time_hour
flight_merged <-flights %>%
  left_join(
    weather %>% select(origin, time_hour, temp:visib),  # all weather vars
    by = c("origin", "time_hour")
  )

nrow(flight_merged) == nrow(flights)   # TRUE = same number of flights with added weather information 
[1] TRUE
#EDA checklist 
dim(flight_merged)
[1] 336776     30
head(flight_merged)
# A tibble: 6 × 30
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 22 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, dep_period <chr>,
#   arr_period <chr>, temp <dbl>, dewp <dbl>, humid <dbl>, wind_dir <dbl>,
#   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#   visib <dbl>
tail(flight_merged)
# A tibble: 6 × 30
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     9    30       NA           1842        NA       NA           2019
2  2013     9    30       NA           1455        NA       NA           1634
3  2013     9    30       NA           2200        NA       NA           2312
4  2013     9    30       NA           1210        NA       NA           1330
5  2013     9    30       NA           1159        NA       NA           1344
6  2013     9    30       NA            840        NA       NA           1020
# ℹ 22 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, dep_period <chr>,
#   arr_period <chr>, temp <dbl>, dewp <dbl>, humid <dbl>, wind_dir <dbl>,
#   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#   visib <dbl>
str(flight_merged)
tibble [336,776 × 30] (S3: tbl_df/tbl/data.frame)
 $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
 $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
 $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
 $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
 $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
 $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
 $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
 $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
 $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
 $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
 $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
 $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
 $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
 $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
 $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
 $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
 $ dep_period    : chr [1:336776] "early morning" "early morning" "early morning" "early morning" ...
 $ arr_period    : chr [1:336776] "morning" "morning" "morning" "morning" ...
 $ temp          : num [1:336776] 39 39.9 39 39 39.9 ...
 $ dewp          : num [1:336776] 28 25 27 27 25 ...
 $ humid         : num [1:336776] 64.4 54.8 61.6 61.6 54.8 ...
 $ wind_dir      : num [1:336776] 260 250 260 260 260 260 240 260 260 260 ...
 $ wind_speed    : num [1:336776] 12.7 15 15 15 16.1 ...
 $ wind_gust     : num [1:336776] NA 21.9 NA NA 23 ...
 $ precip        : num [1:336776] 0 0 0 0 0 0 0 0 0 0 ...
 $ pressure      : num [1:336776] 1012 1011 1012 1012 1012 ...
 $ visib         : num [1:336776] 10 10 10 10 10 10 10 10 10 10 ...
summary(flight_merged)
      year          month             day           dep_time    sched_dep_time
 Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
 1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
 Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
 Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
 3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
 Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
                                                 NA's   :8255                 
   dep_delay          arr_time    sched_arr_time   arr_delay       
 Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
 1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
 Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
 Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
 3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
 Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
 NA's   :8255      NA's   :8713                  NA's   :9430      
   carrier              flight       tailnum             origin         
 Length:336776      Min.   :   1   Length:336776      Length:336776     
 Class :character   1st Qu.: 553   Class :character   Class :character  
 Mode  :character   Median :1496   Mode  :character   Mode  :character  
                    Mean   :1972                                        
                    3rd Qu.:3465                                        
                    Max.   :8500                                        
                                                                        
     dest              air_time        distance         hour      
 Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
 Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
 Mode  :character   Median :129.0   Median : 872   Median :13.00  
                    Mean   :150.7   Mean   :1040   Mean   :13.18  
                    3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
                    Max.   :695.0   Max.   :4983   Max.   :23.00  
                    NA's   :9430                                  
     minute        time_hour                    dep_period       
 Min.   : 0.00   Min.   :2013-01-01 05:00:00   Length:336776     
 1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00   Class :character  
 Median :29.00   Median :2013-07-03 10:00:00   Mode  :character  
 Mean   :26.23   Mean   :2013-07-03 05:22:54                     
 3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00                     
 Max.   :59.00   Max.   :2013-12-31 23:00:00                     
                                                                 
  arr_period             temp             dewp           humid       
 Length:336776      Min.   : 10.94   Min.   :-9.94   Min.   : 12.74  
 Class :character   1st Qu.: 42.08   1st Qu.:26.06   1st Qu.: 43.99  
 Mode  :character   Median : 57.20   Median :42.80   Median : 57.73  
                    Mean   : 57.00   Mean   :41.63   Mean   : 59.56  
                    3rd Qu.: 71.96   3rd Qu.:57.92   3rd Qu.: 75.33  
                    Max.   :100.04   Max.   :78.08   Max.   :100.00  
                    NA's   :1573     NA's   :1573    NA's   :1573    
    wind_dir       wind_speed       wind_gust          precip       
 Min.   :  0.0   Min.   : 0.000   Min.   :16.11    Min.   :0.00000  
 1st Qu.:130.0   1st Qu.: 6.905   1st Qu.:20.71    1st Qu.:0.00000  
 Median :220.0   Median :10.357   Median :24.17    Median :0.00000  
 Mean   :201.5   Mean   :11.114   Mean   :25.25    Mean   :0.00456  
 3rd Qu.:290.0   3rd Qu.:14.960   3rd Qu.:28.77    3rd Qu.:0.00000  
 Max.   :360.0   Max.   :42.579   Max.   :66.75    Max.   :1.21000  
 NA's   :9796    NA's   :1634     NA's   :256391   NA's   :1556     
    pressure          visib       
 Min.   : 983.8   Min.   : 0.000  
 1st Qu.:1012.7   1st Qu.:10.000  
 Median :1017.5   Median :10.000  
 Mean   :1017.8   Mean   : 9.256  
 3rd Qu.:1022.8   3rd Qu.:10.000  
 Max.   :1042.1   Max.   :10.000  
 NA's   :38788    NA's   :1556    

335776 rows and 30 columns or variables. The data is all from year 2013 the months are a little bit out of order because the tail show September as the last few rows while the data set has December. I then checked all the variables in the merged data set, using summary function to look for Na’s and any extreme or unreasonable numbers that should be filtered as Na’s.

Question 7

# Average departure delay per day Group by (year, month, day)
avg_delay_day <- flights %>%
  group_by(year, month, day) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>% #calc the average of the departure delay
  arrange(desc(avg_dep_delay)) # arrange in descending order
`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
# Show the worst day
head(avg_delay_day, 1)
# A tibble: 1 × 4
# Groups:   year, month [1]
   year month   day avg_dep_delay
  <int> <int> <int>         <dbl>
1  2013     3     8          83.5
# Average departure delay by airport and day 
avg_delay_day_origin <- flights %>%
  group_by(origin, year, month, day) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  arrange(desc(avg_dep_delay))
`summarise()` has grouped output by 'origin', 'year', 'month'. You can override
using the `.groups` argument.
head(avg_delay_day_origin, 1)
# A tibble: 1 × 5
# Groups:   origin, year, month [1]
  origin  year month   day avg_dep_delay
  <chr>  <int> <int> <int>         <dbl>
1 LGA     2013     3     8          106.
# Average delay by hour and origin
avg_delay_hour_origin <- flights %>%
  group_by(origin, year, month, day, hour) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  arrange(desc(avg_dep_delay))
`summarise()` has grouped output by 'origin', 'year', 'month', 'day'. You can
override using the `.groups` argument.
head(avg_delay_hour_origin, 1)
# A tibble: 1 × 6
# Groups:   origin, year, month, day [1]
  origin  year month   day  hour avg_dep_delay
  <chr>  <int> <int> <int> <dbl>         <dbl>
1 LGA     2013     7    28    21          280.

1)The worst average length of delays for departures are 2013/03/08 with an aberage delay of 83.53692 minutes. 2) LGA has the worst single day for delays and it is 2013.03.08 with an average of 105.7249 minutes. 3) LGA at 9 PM had the worst single hour for delay on 2013/07/28 for an average delay of 279.6667 minutes.

Question 8

avg_by_dest <- flights %>%
  group_by(dest) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE),
            n = n(), .groups = "drop")

airports_merged <- airports %>%
  right_join(avg_by_dest, by = c("faa" = "dest")) # Keep all rows from avg_by_dest + Keeps only destinations that appear in flights
# Map 
us <- map_data("state")

ggplot() +
  geom_polygon( #draw state boundaries 
    data = us, #from us <- map_data("state")
    aes(x=long, y=lat, group = group), 
    fill = "white", color = "black", linewidth = 0.2 # create a grey map of the U.S. as background layer
  ) +
  geom_point(
    data = airports_merged, #add info of the airports
    aes(x = lon, y = lat, color = avg_dep_delay, size = n), # size = n is how many flights were recorded at that airport + color shows average departure delays (darker = worse)
    alpha = 0.8
  ) +  
  scale_color_gradient(
  low = "blue",
  high = "red",
  name = "Avg departure delay (min)"
)+
  labs(
    title = "Average Departure Delay by Destination Airport (2013)",
      x = "Longitude", 
    y = "Latitude"
  )
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).

Airports located in the East Coast and Midwest show generally higher average delays, whereas many West Coast airports exhibit moderate delays. Airports with a larger number of flights tend to experience moderate but consistent delays, likely due to heavy air traffic and congestion. In contrast, smaller or regional airports usually have lower average delays because of reduced flight volume. A few small airports in the central U.S. display high average delays despite having few flights—this pattern may be linked to severe or stormy weather conditions that cause irregular disruptions rather than congestion-related delays.There are more significant delays in the eastern U.S. compared to the West. This could be due to the higher concentration of airports and denser air traffic in the eastern region, leading to more congestion.

# Summary
flight_merged %>%
  select(dep_delay, temp, wind_speed, wind_gust, precip, visib, pressure) %>%
  summary()
   dep_delay            temp          wind_speed       wind_gust     
 Min.   : -43.00   Min.   : 10.94   Min.   : 0.000   Min.   :16.11   
 1st Qu.:  -5.00   1st Qu.: 42.08   1st Qu.: 6.905   1st Qu.:20.71   
 Median :  -2.00   Median : 57.20   Median :10.357   Median :24.17   
 Mean   :  12.64   Mean   : 57.00   Mean   :11.114   Mean   :25.25   
 3rd Qu.:  11.00   3rd Qu.: 71.96   3rd Qu.:14.960   3rd Qu.:28.77   
 Max.   :1301.00   Max.   :100.04   Max.   :42.579   Max.   :66.75   
 NA's   :8255      NA's   :1573     NA's   :1634     NA's   :256391  
     precip            visib           pressure     
 Min.   :0.00000   Min.   : 0.000   Min.   : 983.8  
 1st Qu.:0.00000   1st Qu.:10.000   1st Qu.:1012.7  
 Median :0.00000   Median :10.000   Median :1017.5  
 Mean   :0.00456   Mean   : 9.256   Mean   :1017.8  
 3rd Qu.:0.00000   3rd Qu.:10.000   3rd Qu.:1022.8  
 Max.   :1.21000   Max.   :10.000   Max.   :1042.1  
 NA's   :1556      NA's   :1556     NA's   :38788   
library(ggplot2)
layout(matrix(1:2, nrow = 3))
Warning in matrix(1:2, nrow = 3): data length [2] is not a sub-multiple or
multiple of the number of rows [3]
# Relationship between temp and departure delay
ggplot(flight_merged, aes(x = temp, y = dep_delay)) +
  geom_point(alpha = 0.2) +
  labs(title = "Departure Delay vs. Temperature", x = "Temperature", y = "Departure Delay (min)")
Warning: Removed 9800 rows containing missing values or values outside the scale range
(`geom_point()`).

# Relationship between pressure and departure delay
ggplot(flight_merged, aes(x = pressure, y = dep_delay)) +
  geom_point(alpha = 0.2) +
  labs(title = "Departure Delay vs. Pressure", x = "Pressure", y = "Departure Delay (min)")
Warning: Removed 44574 rows containing missing values or values outside the scale range
(`geom_point()`).

# Relationship between wind speed and departure delay
ggplot(flight_merged, aes(x = wind_speed, y = dep_delay)) +
  geom_point(alpha = 0.2) +
  labs(title = "Departure Delay vs. Wind Speed", x = "Wind Speed (mph)", y = "Departure Delay (min)")
Warning: Removed 9861 rows containing missing values or values outside the scale range
(`geom_point()`).

# Relationship between precipitation and departure delay
ggplot(flight_merged, aes(x = precip, y = dep_delay)) +
  geom_point(alpha = 0.2) +
  labs(title = "Departure Delay vs. Precipitation", x = "Precipitation", y = "Departure Delay (min)")
Warning: Removed 9783 rows containing missing values or values outside the scale range
(`geom_point()`).

# Relationship between wind gust and departure delay
ggplot(flight_merged, aes(x = wind_gust, y = dep_delay)) +
  geom_point(alpha = 0.2) +
  labs(title = "Departure Delay vs. Wind Gust ", x = "Wind Gust", y = "Departure Delay (min)")
Warning: Removed 259042 rows containing missing values or values outside the scale range
(`geom_point()`).

# Relationship between visibility  and departure delay
ggplot(flight_merged, aes(x = visib, y = dep_delay)) +
  geom_point(alpha = 0.2) +
  labs(title = "Departure Delay vs. Visibility", x = "Visibility", y = "Departure Delay (min)")
Warning: Removed 9783 rows containing missing values or values outside the scale range
(`geom_point()`).

layout(1)

PRIMARY QUESTION: Which weather phenomena have the most impact on flight delays?

ANSWER The median of depature delay is -2 and mean is 12.64, and there are 8255 missing data. On 2013/03/08, the worst delay happened and the weather that day is cloudy and rainy (https://www.timeanddate.com/weather/usa/new-york/historic?month=3&year=2013). Using ggplot to examine the relationship between flight delayed and weather condition. The relationship between temperature and departure delay is not very strong or consistent, but some patterns can still be interpreted. Flights experience delays across a wide range of temperatures, although delays appear slightly less common at the extreme ends (near 0°F or 100°F). This may occur because flights are more likely to be canceled rather than delayed when temperatures become dangerously low or high, due to equipment and safety concerns.

Air pressure shows a clearer weather-related pattern. Lower pressure indicates a higher likelihood of storms or unstable atmospheric conditions. There may be fewer recorded delays because many flights are canceled entirely for safety reasons. Meanwhile, higher pressure corresponds to clearer skies and more stable weather, which is typically associated with fewer and shorter delays. For precipitation, there is no very clear relationship with departure delay. However, from an operational standpoint, heavy rain or freezing rain can lead to flight cancellations, while light or no precipitation usually indicates favorable flying conditions. One possible reason why low-precipitation days still show delays could be air traffic congestion — when weather is good, more flights are scheduled, increasing the potential for traffic-related delays rather than weather-related ones.

Regarding wind speed, it may seem counter intuitive, but very low wind speeds are not always ideal for flight operations. Airplanes require a certain amount of headwind — typically around 40–46 mph (35–40 knots) — to assist with takeoff and landing. When wind speeds are too calm, aircraft may need to wait for optimal conditions, which can lead to departure delays. There is no clear relationship between wind gust and departure delay. While stronger wind gusts can lead to flight cancellations for safety reasons. For visibility, as visibility increases, departure delays also tend to rise slightly. This could occur because clear-weather days attract higher flight traffic, leading to air traffic congestions. In contrast, when visibility is extremely low (close to 0 miles), flights are often canceled instead of delayed. Moderate visibility (around 3–6 miles) may cause some delays, as operations slow down to maintain safety.